POLS 2972Q

Quantitative Analysis in Political Science

Lecture 5 | R Packages and Getting/Loading Data

Plan for Today

  • Examine R Packages

  • Importing data from different sources

R Packages

  • R is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free. For today, and the future, we will use many R packages including:

  • The suite of tidyverse packages: for data wrangling and data visualization

  • gapminder: for easy access to an excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country

Install R Packages

  • If these packages are not already available in your R environment, install them by typing the following three lines of code into the console of your RStudio session. Note that you can check to see which packages (and which versions) are installed by inspecting the Packages tab in the lower right panel of RStudio.


install.packages("tidyverse")
install.packages("gapminder")

Load R Library

  • You may need to select a server from which to download; any of them will work. Next, you need to load these packages in your working environment. We do this with the library function. Run the following lines in your console.


library(tidyverse)
library(gapminder)
  • You only need to install packages once, but you need to load them each time you relaunch RStudio.

  • The Tidyverse packages share common philosophies and are designed to work together. You can find more about the packages in the tidyverse at https://www.tidyverse.org.

Exploring the Gapminder Data

To get started, run the following command to load the data and save it to a local version called gm .


gm <- gapminder::gapminder


  • This command instructs R to load some data: the gapminder data set. You should see that the environment area in the upper right hand corner of the RStudio window now lists a data set called gm that has 1704 observations of 6 variables. As you interact with R, you will create a series of objects. Sometimes you load them as we have done here, and sometimes you create them yourself as the byproduct of a computation or some analysis you have performed.
  • We can view part of the data set by typing its name into the console.


gm
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows
  • However, printing the whole dataset in the console is not that useful.

  • One advantage of RStudio is that it comes with a built-in data viewer. Click on the name gm in the Environment pane (upper right window) that lists the objects in your environment. This will bring up an alternative display of the data set in the Data Viewer (upper left window). You can close the data viewer by clicking on the x in the upper left hand corner.

  • What you should see are seven columns of numbers, each row representing a different combination of country and year: the first entry in each row is simply the row number (an index we can use to access the data from individual years if we want), the second is the country, followed by the continent in which the country is located, year, and the last three columns represent the life expectancy, population, and gross domestic product (GDP) per capita for that country in the given year, respectively. Use the scroll bar on the right side of the console window to examine the complete data set.

  • Note that the row numbers in the first column are not part of the data. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored this data set in a kind of spreadsheet or table called a data frame.

  • You can see the dimensions of this data frame as well as the names of the variables and the first few observations by typing:


glimpse(gm)
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …


or


head(gm)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.
  • It is better practice to type this command into your console, since it is not necessary code to include in your solution file.

  • We can see that there are 1704 observations and 6 variables in this dataset. The variable names are country, continent, year, lifeExp, pop, and gdpPercap.

Further Exploration

  • Let’s start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like


gm$continent


  • This command will only show the names of continents given in each row of the data set. The dollar sign basically says “go to the data frame that comes before me, and find the variable that comes after me”.

1). What command would you use to extract just the country names?

  • Notice that the way R has printed these data is different. When we looked at the complete data frame, we saw 1704 rows, one on each line of the display. These data are no longer structured in a table with other variables, so they are displayed one right after another. Objects that print out in this way are called vectors; they represent a set of numbers. R has added numbers in [brackets] along the left side of the printout to indicate locations within the vector. For example, Asia follows [1], indicating that Asia is the first entry in the vector. And if [43] starts a line, then that would mean the first item on that line would represent the 43rd entry in the vector.

Brief Introduction to Data Visualization

  • R has some powerful functions for making graphics. We can create a simple plot of the life expectancy of each country per year with the command


ggplot(data = gm, aes(x = lifeExp, y = gdpPercap)) + 
  geom_point()

ggplot()

  • We use the ggplot() function to build plots. If you run the plotting code in your console, you should see the plot appear under the Plots tab of the lower right panel of RStudio. Notice that the command above again looks like a function, this time with arguments separated by commas.

  • with ggplot()

    • The first argument is always the dataset
    • Next, you provide the variables from the dataset to be assigned to aesthetic elements of the plot, e.g. the x and the y axes.
    • Finally, you use another layer, separated by a + to specify the geometric object for the plot. Since we want to scatterplot, we use geom_point()
  • For instance, if you wanted to visualize the above plot using a line graph, you would replace geom_point() with geom_line().


gm1 <- gm |>
  group_by(year) |>
  summarize(avglifeExp = mean(lifeExp))

ggplot(data = gm1, aes(x = year, y = avglifeExp)) + 
  geom_line()

  • You might wonder how you are supposed to know the syntax for the ggplot function. Thankfully, R documents all of its functions extensively. To learn what a function does and its arguments that are available to you, just type in a question mark followed by the name of the function that you’re interested in. Try the following in your console:


?ggplot


  • Notice that the help file replaces the plot in the lower right panel. You can toggle between plots and help files using the tabs at the top of that panel.

Importing Data

  • Getting and loading data can be a difficult step
  • Location, Location, Location!
  • Where do your data live?
    • In R
    • On your computer (offline)
    • Exists on the web
      • Can be loaded view an R package
      • Can be accessed via URL
      • Can be downloaded to your computer and loaded using the file path

From R Packages

# From the "datasets" package, load in the AirPassengers data
datasets::AirPassengers
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
airpass <- datasets::AirPassengers
# From the "gapminder" package, load in the gapminder data
gapminder::gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows
gm <- gapminder::gapminder

From your Computer

  • Where is R currently looking for files on your machine?
    • get working directory


getwd()


  • Where do you want R to look for files?
    • set working directory


setwd("~/Desktop")


  • As we did above, you can set the working directory by simply typing your location into the console. Additionally, you can “point and click” within RStudio to set the working directory.

  • Session -> Set Working Directory -> “Choose your option”

  • Files -> “Choose your file” -> More -> Set as Working Directory (In the lower right side window)

Getting Data into R

  • How are the data stored?
  • What is the file extension (e.g., .csv, .xlsx)
  • Did the data import correctly?
  • Do we need to make any changes?

How Are The Data Stored

  • Structured
    • Quanitative
      • Spreadsheets
  • Unstructured
    • Qualitative
      • Social Media Post, Emails, Business Reports
  • Temporal
    • Stock Ticks
  • Geolocation (shapefiles)
    • Georeferenced points or areas

File Extensions

  • .csv (Comma Seperated Values)
  • .xlsx (Microsoft Excel Spreadsheet)
  • .txt (Text File (What is the delimiter?))

We Will Use…

  • Structured data from R packages or clean(ish) .csv files
  • Plain text rectangular files (or flat files)
  • Using functions from tidyverse to load the files
  • Spatial data files

Helpful Tips

  • .csv is preferred
    • If you have an Excel spreadsheet, save it as .csv
  • Look at the original data file
    • Are there blank rows or columns?
    • Is there metadata contained in the spreadsheet?
    • Is there inconsistent/messy data?
    • Do dates look like dates?
    • How much data cleaning or wrangling is required?

Break